MINING DISCRIMINATIVE ITEMS IN MULTIPLE DATA STREAMS by Zhenhua
نویسنده
چکیده
How can we maintain a dynamic profile capturing a user’s reading interest against the common interest? What are the queries that have been asked 1, 000 times more frequently to a search engine from users in Asia than in North America? What are the keywords (or tags) that are 1, 000 times more frequent in the blog stream on computer games than in the blog stream on Hollywood movies? To answer such interesting questions, we need to find discriminative items in multiple data streams. Each data source, such as Web search queries in a region and blog postings on a topic, can be modeled as a data stream due to the fast growing volume of the source. Motivated by the extensive applications, in this thesis, we study the problem of mining discriminative items in multiple data streams. We show that, to exactly find all discriminative items in stream S1 against stream S2 by one scan, the space lower bound is Ω(|Σ| log n1 |Σ|), where Σ is the alphabet of items and n1 is the current size of S1. To tackle the space challenge, we develop three heuristic algorithms that can achieve high precision and recall using sub-linear space and sub-linear processing time per item with respect to |Σ|. The complexity of all algorithms are independent of the size of the two streams. An extensive empirical study using both real data sets and synthetic data sets verifies our design.
منابع مشابه
Incremental Mining of Across-streams Sequential Patterns in Multiple Data Streams
Sequential pattern mining is the mining of data sequences for frequent sequential patterns with time sequence, which has a wide application. Data streams are streams of data that arrive at high speed. Due to the limitation of memory capacity and the need of real-time mining, the results of mining need to be updated in real time. Multiple data streams are the simultaneous arrival of a plurality ...
متن کاملMining Noisy Data Streams via a Discriminative Model
The two main challenges typically associated with mining data streams are concept drift and data contamination. To address these challenges, we seek learning techniques and models that are robust to noise and can adapt to changes in timely fashion. In this paper, we approach the stream-mining problem using a statistical estimation framework, and propose a discriminative model for fast mining of...
متن کاملStream Data Mining: A Survey
A data stream is a massive, continuous and rapid sequence of data elements. Mining data streams raises new problems for the data mining community about how to mine continuous high-speed data items that you can only have one look at. Due to this reason, traditional data mining approach is replaced by systems of some special characteristics, such as continuous arrival in multiple, rapid, time-var...
متن کاملOnline Mining Changes of Items over Continuous Append-only and Dynamic Data Streams
Online mining changes over data streams has been recognized to be an important task in data mining. Mining changes over data streams is both compelling and challenging. In this paper, we propose a new, single-pass algorithm, called MFC-append (Mining Frequency Changes of append-only data streams), for discovering the frequent frequency-changed items, vibrated frequency changed items, and stable...
متن کاملCost-Efficient Mining Techniques for Data Streams
A data stream is a continuous and high-speed flow of data items. High speed refers to the phenomenon that the data rate is high relative to the computational power. The increasing focus of applications that generate and receive data streams stimulates the need for online data stream analysis tools. Mining data streams is a real time process of extracting interesting patterns from high-speed dat...
متن کامل